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Abstract 

Non-convex constraints have recently proven a valuable tool in many 
optimisation problems. In particular sparsity constraints have had a sig- 
nificant impact on sampling theory, where they are used in Compressed 
Sensing and allow structured signals to be sampled far below the rate tra- 
ditionally prescribed. 

Nearly all of the theory developed for Compressed Sensing signal re- 
covery assumes that samples are taken using linear measurements. In this 
paper we instead address the Compressed Sensing recovery problem in a 
setting where the observations are non-linear. We show that, under con- 
ditions similar to those required in the linear setting, the Iterative Hard 
Thresholding algorithm can be used to accurately recover sparse or struc- 
tured signals from few non-linear observations. 

Similar ideas can also be developed in a more general non-linear opti- 
misation framework. In the second part of this paper we therefore present 
related result that show how this can be done under sparsity and union of 
subspaces constraints, whenever a generalisation of the Restricted Isometry 
Property traditionally imposed on the Compressed Sensing system holds. 

Key words and phrases : Compressed Sensing, Nonlinear Optimisation, 
Non-Convex Constraints, Inverse Problems 



1 Introduction 

Compressed Sensing [H [2j [3] deals with the acquisition of finite dimensional 
sparse signals. Let x be a sparse vector of length N and assume we sample 
x using M linear measurements. The M samples can then be collected into a 
vector y of length M and the sampling process can be described by a matrix 
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If the observations are noisy, then the Compressed Sensing observation model is 

y = $x + e, (1) 

where e is the noise vector. If M < N, then such a linear system is not uniquely 
invertible in general, unless we use additional assumptions on x. Sparsity of x 
is such an assumption and Compressed Sensing theory tells us that, for certain 
<&, we can recover x from y even if M « N, given that x has roughly O(M) 
non-zero elements. However, in general, recovery of x is a combinatorial problem 
which is known to be NP-hard. Fortunately, under stricter conditions on a 
range of different polynomial time algorithms can be used to recover x whenever 
x has roughly 0(M/log(N)) non-zero elements. 

One of the conditions that guarantees that we can use efficient algorithms 
is the Restricted Isometry Property (RIP). A matrix $ satisfies the Restricted 
Isometry Property of order 2k pQ if 

(1 - < 5)||x 1 +x 2 || 2 < ||*( Xl +x 2 )|| 2 < (l + 5)|| Xl +x 2 || 2 (2) 

for all fe-sparse xi and x 2 . The Restricted Isometry Constant 5 is defined as the 
smallest constant for which this property holds. One important interpretation 
of the RIP is in terms of the Lipschitz property of 3? and its inverse (where the 
inverse is defined only for sparse vectors and their image under <&) |14j and the 
condition states that, not only is $ invertible on the set of sparse signals, this 
inverse is also smooth. 

The RIP condition is a sufficient condition for the recovery of sparse x. For 
example, 0j has shown that, for any x, given an observation y = <&x + e, where 
$ has the Restricted Isometry Property with 5 < v2 — 1, then the solution x* 
to the convex optimisation problem 

min||x||i : ||y - 3>x|| 2 < ||e|| 2 (3) 

X 

has an error bounded by 

||x* - x|| < cAT a5 ||x - Xfc||i + c ||e||, (4) 

where || • ||i is the vector 1 norm, x& is the best k term approximation to x and 
where c and d are two constants depending only on 5. 

Similar results have been obtained for other algorithms, such as the Com- 
pressed Sampling Matching Pursuit (CoSaMP) and Subspace Pursuit (SP) al- 
gorithms 0(6] and the Iterative Hard Thresholding (IHT) algorithm [7]. 

Several generalisations to this now classical Compressed Sensing setup have 
been introduced over the years. For example, some of the recovery algorithms, 
such as CoSaMP, SP and IHT, can be adapted to allow signals x to lie in a 
much more general, non-convex constraint set A. A powerful model here is for 
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example the Union of Subspaces model, in which x is assumed to lie on one of 
several linear subspaces At, though it is not known a priori on which subspace we 
are to look. Not only does this framework include the standard sparse model as 
a special instance, many other models of interest, such as analogue Compressed 
Sensing methods [8], low rank matrix models [SJ, or structured sparse models 
[TO] , are also covered. 

In this more general setting, with a general non-convex constraint sets A, 
Compressed Sensing can be formulated as the following optimisation problem, 



that is, we search a vector x from the non-convex constraint set A that minimises 
the sum of squares observation error. 

In this paper we look at a much more general setting, where we want to find 
the following optimum. 



where /(x) is now a much more general non-linear function of x. 

Of particular interest to us are non-linear Compressed Sensing problems 
where /(x) = ||y— 4?(x)||, with <E»(x) being a non- linear mapping from one vector 
space to another. We address this non-linear Compressed Sensing problem first, 
however, the more general problem in equation ([6]) is of independent interest 
and an alternative treatment will be presented in the second part of this paper. 

When we started studying these problems, not much was known of this gen- 
eral setting. However, since the first draft of this paper [TT], similar ideas have 
been put forward independently in [12], where the non-linear Compressed Sens- 
ing problem was tackled using a convexification approach, and in [13], where 
non-convex optimisation problems were studied using an alternative greedy ap- 
proach to the one discussed here. Whilst the first part of this paper contains 
more recent results, the second part of this paper is basically the same material 
that can be found in the earlier draft of this paper [11] . 

2 Non-Linear Compressed Sensing 

We are here interested in the development of a better understanding of what 
happens to the Compressed Sensing recovery problem when a signal is measured 
with some non-linear system. In particular, the hope is that, if the system is not 
too non-linear, then recovery should still be possible under similar assumption 
to those made in linear Compressed Sensing. To see the intuition behind why 
this might work, it is worth pointing out that in the linear setting, Compressed 
Sensing recovery works exactly in those cases in which the observation system 
is a bi-Lipschitz embedding. This means that, both, the observation mapping 
itself, as well as its inverse are Lipschitz functions. Obviously, these functions 



argmin x&A ||y - *x|||, 



(5) 




(6) 
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are only Lipschitz on the constraint set A and its image <&A. In the linear 
setting, if <& is bounded (e.g. in finite dimensional spaces), then 3> itself is 
obviously Lipschitz. The idea is now that, if Compressed Sensing works if both 
forward and backward maps are Lipschitz, maybe we can move away from a 
linear setting, where is linear, and instead assume 3> to be Lipschitz, but 
non-linear. 

The study of non-linear observation systems is not only of academic interest 
but has important implications for many real-world sampling systems, where 
measurement system can often not be designed to be perfectly linear. Assume 
therefore that our measurements are described by a nonlinear mapping <&(•) that 
maps elements of the normed vector spaces H into the normed vector spaces B. 
The observation model is therefore 

y = *(x)+e, (7) 

where e G B is an unknown but bounded error term. Both % and B are assumed 
to be Hilbert spaces. 

2.1 The Constraints 

As in Compressed Sensing, the interesting case occurs whenever the sampling 
system 3> is non-invertible or ill-conditioned. To cope with this, additional 
constraints need to be imposed on x. Again, in the interest of generality, instead 
of restricting our discussion to sparse signals (however these might be defined 
in a general Hilbert spaces) we here use the more general framework of [2] and 
assume that x lies in or close to a known set A, where A C % is a non-convex 
subset of H. Of particular interest will be constraint sets A that can be described 
as the union of several subspaces. For these models we can write 

A = \jA l , (8) 

where we use arbitrary closed subspaces Ai C % 

One approach to recover x from y would be to mirror Compressed Sensing 
ideas and to define a convex objective function which can then be optimised using 
standard tools. However, for our general setup, it is not clear how this could 
be done. Instead, we use the Iterative Hard Thresholding (IHT) algorithm. To 
define this for general constraint sets A, we again replace the hard thresholding 
step with a more general map which can be understood as a form of projection 
|14j . Let P4 be a map from T~L to A such that 

x_4 = -Pa(x) : x_4 6i, ||x — x_4|| 2 < inf ||x — x|| 2 + e. (9) 

In this definition we have introduced an arbitrarily small constant e > 0, as there 
might not exist an x opi , such that ||x — x opt || 2 = infx g _4 ||x — x|| 2 . However, for 
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simplicity, we will assume for the rest of this paper that A is a so called proximal 
set, which is just a fancy way of saying that the required optimal points indeed 
lie in the set A, so that we use e = here. Nevertheless, it is easy to adapt our 
theory to the more general setting. 

Note that this "projection" might not be defined uniquely in general, as for 
a given x, there might be several elements x_4 that satisfy the condition in ([9]). 
However, all we require here is that the map -Pa(x) returns a single element 
from the set of admissible x^ (which is guaranteed to be non-empty |14|). How 
this selection is done is of no consequence for our arguments here. 

It is further worth noting that the relaxation offered by an e > in the 
definition of the above projection has also a computational advantage. Instead 
of having to compute exact optima, which for many problems are often difficult 
to find, many approximate algorithms can be used instead (see |15| for a more 
detailed discussion). 

2.2 The Iterative Hard Thresholding Algorithm for Non-Linear 
Compressed Sensing 

For the linear Compressed Sensing problem, the Iterative Hard Thresholding 
(IHT) algorithm uses the following iteration 

x n+1 = P A (x» + ^**(y - *x n ), (10) 

where is the linear measurement operator. 

In the non- linear case, let us approximate <&(x) using an affine Taylor series 
type approximation around a point x*, so that <&(x) ~ 3>(x*) + 3> x *(x — x*), 
where 3> x * is a linear operator (such as the Jacobian of 3?(x), evaluated at point 
x*). The matrix thus will depend on x* in general. At iteration n we then 
write the IHT algorithm as 

x n+1 = P4(x" + ^* x „(y - *(x n )). (11) 

Indeed, as we show below in 12. 4} this algorithm can recover x under similar 
condition to those required from the IHT algorithm in the linear setting. All 
we require is that the matrices <& x * satisfy a Restricted Isometry Property and 
that the error introduced in the linearisation is not too large, i.e. that ||<l?(x_4) — 
3>(x x ) — ^^(x^ — x n )|| is small for large n. 

Theorem 1. Assume that y = 3>(x) + e and that 3> x * is a linearisation of 
<&(•) at x* so that the Iterative Hard Thresholding algorithm uses the iteration 
x n+i _ p 4 ( x n + //<!>* n (y — 3> x nx n ). Assume that 3> x * satisfies RIP 

a||xi - x 2 ||l < a||* x *(xi - x 2 )||l < p\\*i - x 2 ||l (12) 
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for all Xi,X2,x* G A, with constants satisfying (3 < 1/fi < 1.5a. Define 

e n A = y - *(x n ) - * x n (x^ - x n ) (13) 
and e k = b Yln=o ak _1_n |l e l4ll 2 > where b = A/a and a = 2/(/xa) — 2, then after 



k 



ln(2/(/ia) - 2) 



iterations we have 



|x - x fe *|| < (1 + 5)Ve~k + ||x^ - x||. 



(14) 



(15) 



Obviously, for the above theorem to make sense, we would require the error 
term e k to be well behaved. This is true whenever ||y — <&(x n ) — <& x n (x_4 — x n ) ||2 



is bounded, as then e < b^2 



*-l r k-l- 
n=0 ' 



l C ', for some constant C so that the 



requirement that 1//U < 1.5a ensures that a = 2/{^a) — 2 < 1, which in turn 
implies that the geometric series J2n=o ah _1_ra is bounded. 



Indeed, if a < 1 and if we can show that e n 



,n ||2 ■ 



is bounded and con- 



vergent to some €n m , then e k will also be bounded as the following argument 
shows 



e k /b 



k-l 

£ 

ra=0 
p-1 



a k ~ n - x e n 



k-l 



n=0 
p-1 



k-l 



< e^^ + ^E^ 1 

n=0 p 
P-1 

< E« fc_n_1 en + e P 



n=0 



1-a 



^— ' 1 — a 

n=0 

p-1 1 

< ^^(T-' + ep- 

„ 1 — a 



n=0 
-1 e 



+ 



1 — a 1 — a 



(16) 



Thus, if we let k and p increase to infinity such that < k — p — > oo, then the 
first term on the left converges to zero whilst the second term converges to a 
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limit depending on eu m , so that, if we iterate the algorithm long enough, then 

lim e k < e Um —— (17) 

fc— >oo 1 — a 

and the error term converges to to 



x*|| < \jeu m — h||x^-x||. (18) 

V 1 — a 

Actually, as shown in 12.51 more can be said if we can establish the following 
bound for <&(x) and its linearisation ||3>(xi) — <&(x2) — 3> Xl (xi — X2) ||| < C|| x i — 
^Ill- 
Corollary 2. Assume that y = 3>(x) + e and that 3> x * is a linearisation of 
<&(•) at so that the Iterative Hard Thresholding algorithm uses the iteration 
x «+i = p A ( x n _|_ ^^(y - $ x n X n ). Assume that $ x * satisfies RIP 

a||xi - x 2 ||l < ||*x*(xi - x 2 )||l < P\\xi - x 2 ||l (19) 

for all xi,X2,x* E A, and assume <&(x) and <& x satisfy 

||*(xi) - *(x 2 ) - * Xl (xi - x 2 )||| < C||xi - x 2 |||, (20) 

with constants satisfying ft < 1/fx < 1.5a — AC, then the algorithm converges to 
a solution x* that satisfies 

||x — x* || < c||e_4|| + ||x_4 — x||, (21) 
where e A = y - *(x^) and c = , 75a _l/^ 2 c ■ 
2.3 Example 

Before we proof Theorem [1] and Corollary O let us give a simple example that 
shows how the above method and theory can be applied in a particular setting. 
Assume we have constructed a Compressed Sensing system, where a sparse signal 
x E M. N is measured using a linear measurement system <&. Also assume that we 
have constructed the system so that satisfies the Restricted Isometry Property 
with constants a and /3. Now unfortunately, the sensors we have available for 
the actual measurements are not exactly linear but have a slight non-linearity, 
so that our measurements are of the form 

y = ¥/(x) + e, (22) 

where /(•) is a non-linear function applied to each element of the vector x. For 
simplicity, we will write /(x) = x + /i(x), where again h(x) is a function applied 
element wise. We then have f'{x) = 1 + h'(x). 
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It is not difficult to see that the Jacobian of «&/(x) can be written as 

*x* = * + *#i*, (23) 

where H'^ is the diagonal matrix with the elements /i'(x*) along the diagonal. 

To use Corollary [21 we thus need to determine a) the RIP constant of <& x * 
and b) bound ||<&(xi) — 3?(x 2 ) — <& Xl (xi — X2) ||| as a function of ||xi — X2 1 1 2 - 

The RIP constants are bounded for our example as follows. Assume that 
|/i'(a;)| < M < 1, that xi,X2 6 A and that $ satisfies the RIP with constants a 
and /3. We then have 

(o^l-^MJUxi-xall 

< ||¥(x 1 -x 2 )||-||¥^(x 1 -x 2 )|| 

< ||* x *( Xl -X2)|| 

= ||¥(x 1 -x 2 )+¥^(x 1 -x 2 )|| 

< ||¥(x 1 -x 2 )|| + ||¥^(x 1 -x 2 )|| 

< /3||(x 1 -x 2 )||+/3||^(x 1 -x 2 )|| 

< / 9 1 /2(i + M)||x 1 -x 2 || J (24) 

which proofs the following Lemma. 

Lemma 3. Let <&(x) = <&/(x), where the function /(x) = x + h(x) is applied 
element wise and where the derivative h'{x) is absolutely bounded \h'(-)\ < M. 
Also assume that the matrix $ satisfies the RIP condition with constants a 
and (5 for a set A, then the matrix 3? (I + -ff x *) satisfies RIP with constants 
(a l / 2 l - p^M) 2 and 0(1 - M) 2 . 

Let us now turn to point b). We have the bound 

||*(X 1 )-$(X 2 )-$ X1 (X 1 -X 2 )||2 

= [|*fc(xi)-¥/i(x 2 )-*fl4*(xi-x 2 )||i 
= \\*(h(x. 1 )-h( X2 )-H' Xl x. 1 + H' XlX2 )\\l 
< /3||(Mxi)-Mx2)-^ Xl x 1 + <x 2 ))||2, (25) 

where in the last inequality we assume to satisfy the RIP property and that 
h(0) = (Note that if we do not assume that h(0) = 0, then the same reults still 
hold, though we have to replace f3 by the operator norm of <&). Let us introduce 
the function <i x *(x) = h(x) — -ff x *x, so that 

||*(x 1 )-*(x 2 )-* Xl (x 1 -x 2 )||2 < /3||d xl ( Xl )-c/ Xl (x 2 )))|||. (26) 

Thus if d x * is Lipschitz for all x* 6 A with a small constant K, then the condition 

||*(xi) - *(x 2 ) - * Xl (xi - x 2 )||l < C||xi - x 2 |||, (27) 
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in Corollary [2] holds with C = (3K. 

Thus it remains to show that <i Xl (xi) is Lipschitz. If the Jacobian D Xl (x*) 
of d Xl (xi) satisfies ||.D Xl (x + th)\\ < M for all < t < 1, then we know that 

||d Xl (x + h)-d Xl (x)|| <M||h||, (28) 

so that 

ll^xi(xi) - d Xl (x 2 )|| = ||d Xl (x 2 + xi - x 2 ) - d Xl (x 2 )|| < M||xi - x 2 ||, (29) 
holds if 

p xi (x 2 + t(xi-x 2 ))|| <M (30) 

for all < t < 1. 

For our simple example, we see that L> Xl is in fact a diagonal matrix with 
entries {.D Xl (x*)}^ = h'(xi) — ^(x*), where foj(xi) = /i'({xi}j), so that 

{D X1 (x 2 + t(xi - x 2 ))} M = /i'({x 2 + t(xi - x 2 )}i) - h'({xi}i). (31) 

Thus if \h'(-)\ is bounded, that is, if \h'(-)\ < M, then ||D Xx (x)|| < M. 
We thus have demonstrated the following. 

Lemma 4. Let <&(x) = $/(x), where the function /(•) = x + h{x) is applied 
element wise and where the derivative h'(x) is absolutely bounded \h'(-)\ < M, 
then 

||*(xi)-*(x2)-*x x (xi-X2)||l<C||xi-x 2 ||, (32) 
where C = /3M. 

2.4 Proof of Theorem Q] 

The proof follows basically that in |14j . but with some important modifications 
to account for the non-linear setting analysed here. 

Proof. As always, we start with the triangle inequality 

||x - x n+1 1| < \\x A - x n+1 1| + \\x A - x|| (33) 
and then bound the first term on the left using the definition 

e^ = y-*(x n )-* x ,(x^-x") (34) 



10 T. BLUMENSATH 



and the inequalities 



,n+l||2 



\\*A ~ x 
< -||* X "(X^-X- +1 )|| 2 



a 
1 

a 
1 



y - *(x") - * x n(x" +1 - x m ) - (y - *(x n ) - * x n(x^ - x 



n\\ii2 



n+1 _ x n) 



i|2 



-||y-*(x n )-$ x n(x 

a 

- (||y - *(x») - * x ,(x™ +1 - x«)f + ||ejf - 2<e^ (y - *(x") - * x n(x 
1 



,71+1 



X 



< - (||y - *(x n ) - * xJl (x n+1 - x n )|| 2 + ||e^|| 2 + ||e^|| 2 + ||y - *(x n ) - * x «(x 
- -||y - <£(x n ) - * x n(x n+1 - x n )|| 2 + -||e^|| 2 . 



n+l 



a 



a 



We here used the fact that 



-(e^,(y-*(x")-* x ,(x" +1 -x™))) 

< ||e^||||y-$(x n )-* x n(x n+1 -x n )|| 

< 0.5(||e^|| 2 + || (y - *(x n ) - * x ,(x" +1 - x")|| 2 ). 



The left term in the last line of (I35D is bounded by the next inequality 



y - *(x") - * x .(x" +1 - x™)|| 2 < (- - a)||(x^ - x™)|| 2 + ||e^|| 2 , (36) 
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which is a result of the following argument in which we use g = 2<&*n (y — 3>(x n )) 



|y-*(x")-$ x n(x™ +1 -x™)|| 2 



= -((X n+1 - X™), g) + ||* x n(x n+1 - X 

< .l^n+l,^ M ) + I|| (x „ _ x 
1 



y - *(x 

n\||2 



n\i|2 



r n+l __n\ II 2 



/i 
1 



X 



n+1 



x «_^ g ||2_^|| g ||2 

2 " 2 11 11 



inf || x _ x -_^ g ||2_^|| g ||2 



inf 

xe.4 



-((x-x«),g) + i||(x-x«)|| 2 

h 1 



< -((x^-x"),g) + -||(x^-x")|| 2 



1 



- -2((x^ - x"), $*„(y - *(x"))) + j-\\x A - xl 2 
= -2((x^ - x"), **„(y - *(x"))> + a||x^ - x ra || 2 

+ (--a)||x^-x«|| 2 
A* 

< -2((* x n(x^ - x")), (y - *(x n ))> + ||* x n(x^ - x«)|| 2 

+ ( I_ a )|| x _ 4 _ x n||2 

A* 

= ||y - *(x") - * x n(x^ - x")|| 2 - ||y - d>(x")|| 2 

+ ( I_ a) || x _ 4 _ x «||2 
A 1 

A 1 



(37) 



where the first and last inequalities are due to the RIP property of <& x ™ and the 
)ice of B < 77, whilst the se< 
We have thus shown that 



choice of B < ^, whilst the second inequality is due to the fact that x_4 G .4. 



- x 



n+l ||2 



< 2 



1 



-1 ||(x^-x™)|| 2 + 



(38) 



We can now iterate the above expression. Using e k = b Yln=o l|y ~~ *x»x^( | 
where 6 = 4/a and a = 2/(fia) — 2, we get 



2 „fe— 1— n 
u> , 



|x. 4 -x fc || 2 < ( — 1 . ||x. 4 r + r"'. 



|2 i fc 



(39) 
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Thus 



|x— x fc || < a / c fc 1 1 1 1 2 + e k + ||x_4 — x| 



< c fc/2 ||x^|| + Ve* + \\x A - x||, 
where c = — 2 and the theorem is proven. □ 



2.5 Proof of Corollary [2] 

Proof. Let us start with the bound in ([38 



||x^ - x™ +1 || 2 < 2 f — - l) ||(x^ - x™)|| 2 + — 1 1 1 1 2 (40) 

and let us look a bit more closely at 

e A = y-*(x")-* x n(x4-x") 

= *(x^) + e^-*(x")-* x n(x^-x"), (41) 

where = y — $>(x. A ). We then have 

Hejfcll < ||*(x^)-$(x")-$ x n(x^-x")|| + ||e^|| (42) 

Now by assumption, ||$(x_4) — 3>(x n ) — <& x n ( x .A — x n )|| is bounded as a function 
of ||x_4 — x n ||, i.e. 

||*(x^) - *(x") - * x n(x^ - x™)|| 2 < C\\ XA - x™|| 2 , (43) 

so that (1381) becomes 



||x^ - x™ +1 || 2 < 2 ( — - 1 + -c] \\(x A - x")|| 2 + -||e^|| 2 , (44) 

Thus we require that - 1 + < 0.5, that is that l/n< 1.5a - AC. 

The same argument used in the main proof now holds. Whenever the con- 
stant before the left term on the right hand side is smaller than one, then we 
can iterate the error and the corollary follows. □ 



3 The Iterative Hard Thresholding Algorithm for Non- 
Linear Optimisation 



Let us now return to the more general problem of minimising a non-linear func- 
tion /(x) under the constraint that x£^l, where A is a Union of Subspaces. Let 
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us recall again that for minimisation problems of the form argmin xe _4||y — 3>x|| 2 
we use the algorithm 

x n+1 = P4(x n + **(y-*x n ). (45) 

Note that the update 3>*(y — <&x n ) is a scaled version of the gradient of the cost 
function ||y — <&x|| 2 . 

In the more general setting argmin xgX /(x), where x is an Euclidean vector, 
we can simply replace this update direction with the gradient of /(x) (evalustaed 
at x n ), whilst in more general spaces, we assume that /(x) is Frechet differen- 
tiable with respect to x, that is, for each xi there exist a linear functional -D X1 (•) 
such that 

Hm^' + h >-^'>- D "( h >=0. (46) 
h->o ||h|| 

We can then use Riesz representation theorem to write the linear functional 
D Xl (-) using its inner product equivalent 

D xl (-) = (V(x 1 ),.), (47) 

where V(xi) E T~L. Using h = X2 — xi we see that for each u and Xi we require 
the existence of a V(xi) such that 

Um M-W-W.!--JI =t (48) 

X2->xi ||x2 — Xi ||% 

In Euclidean spaces the Frechet derivative is obviously the differential of /(x) 
at xi, in which case V(xi) is the gradient and (•, •) the Euclidean inner product. 
With a slight abuse of terminology, we will therefore call V(xi) 'the gradient' 
even in more general Hilbert space settings. 

Having thus defined an update direction V(x) in quite general spaces, we are 
now in a position to define an algorithmic strategy to optimise /(x). We again 
use a version of our trusty Iterative Hard Thresholding algorithm, but replace 
the update direction with V(x). With this modification, the algorithm might 
also be called the Projected Landweber Algorithm [16], and is defined formally 
by the iteration 

x" +1 = P4(x"- W2)V(x")), (49) 

where x° = and /i is a step size parameter chosen to satisfy the condition in 
Theorem [5] below. 



3.1 Theoretical Performance Bound 

We now come to the second main result of this paper, which states that, if 
/(x) satisfy the Restricted Strong Convexity Property, then the Iterative Hard 
Thresholding algorithm can find a vector x that is close to the true minimiser 
of /(x) among all x G A. In particular, we have the following theorem. 
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Theorem 5. Let A be a union of subspaces. Given the optimisation problem 
/(x), where /(•) is a positive function that satisfies the Restricted Strict Con- 
vexity Property 

tt< /(xi)-/(x,)-H«(V(».).(x 1 -».)) < 
||xi - x 2 || 2 

for all Xi,X2 £ H for which xi — x.2 £ A + A + A. Let x opt = argmin xg ^/(x) 
and assume that ft < - < |a, then, after 



In IS 



:/( x opt) 



-* = 2 W^' (51) 

iterations, the IHT algorithm calculates a solution x n satisfying 



l|x n *-x||< (2^ j?— + 5\ /(x^) + ||x - XoptW. (52) 

In the traditional Compressed Sensing setting, this result is basically that 
derived in [7]. 

3.2 Proof of the Second Main Result 

Proof of Theorem O The proof requires the orthogonal projection onto a sub- 
space r. The subspace T is defined as follows. Let T be the sum of no more than 
three subspaces of A, such that x opt ,x n ,x n+1 £ T. Let Pp be the orthogonal 
projection onto the subspace T. We write ap = Ppa n and PpV(x n ) = Vr(x n ). 
Note that this ensures that Ppx™ = x n , Prx n+1 = x n+1 and Prx op t = x opt . 
We note for later that with this notation 

i?e(V r (x n ), (x opt - x™)> = Re(P r V(x. n ), (x opt - x n )) 

= Re(V(x n ),P r (x opt -x n )) 
= i?e(V(x n ), ( Xopt - x n )) (53) 

and 

||Vr(x")|| 2 = (Vr(x"),V r (x")) = <P r V(x"), P r V(x™)> 

= <V(x"),P r *P r V(x")> 
= <V(x«),V r (x")>. (54) 

We also need the following lemma. 
Lemma 6. Under the assumptions of the theorem, 

|||v r (x")|| 2 -^/(x")<0. (55) 
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Proof. Using the Restricted Strict Convexity Property we have 
|||v r (x")|| 2 = -£/te<v(x») f -£vr(x»)) 

< |/3|||v r (x")|| 2 + |/(x») - |/(x" - | V r (x"))) 

< | / 3|||v r (x")|| 2 + |/(x"). (56) 

Thus 

(2-^)||^V r (x")|| 2 < M /(x"), (57) 

which is the desired result as \xfi < 1 by assumption. □ 

To prove the theorem, we start by bounding the distance between the current 
estimate x n+1 and the optimal estimate x op i. Let a^ = Xp— /i/2Vr(x n ). Because 

,71 



x ra+1 is the closest element in A to aS, we have 



|x" +1 -x opt || 2 < (|| x " +1 -ap|| + ||ap-x opi ||) 2 



< 4||(a?-x opt )|| 2 

= 4||x"-( / x/2)V r (x")-x op i|| 2 
= 4||(/V2)Vr(x™) + (x opt -x«)|| 2 

= M 2 ||V r (x")|| 2 + 4||x opt - x-|| 2 + 4 / u J Re(V r (x"), (x opi - x")) 
= ^ 2 ||V r (x")|| 2 + 4||x opt - x"|| 2 + 4 / u J Re(V(x"), (x opt - x")> 

< 4||x op i-xl 2 + / x 2 ||Vr(x")|| 2 
+4/z[-a||x n - x opt || 2 + /(x op4 ) - /(x n )] 

= 4(l-/ia)||x op i-x n || 2 + 4/x/(x op4 ) 
+4[||( M /2)V r (x«)|| 2 - / ,/(x")] 

< 4(l-/ia)||x opt -x n || 2 + 4^/(x opt ). (58) 

Here, the second to last inequality is the RSCP and the last inequality is due to 
lemma [H 

We have thus shown that 

||x n+1 - x op4 || 2 < 4(1 - na^Xapt - x"|| 2 + 4/x/(x opt ). (59) 

Thus, with c = 4(1 — /j,a) 

||x" - x opt || 2 < c"||x op4 || 2 + -i^-/( Xopt ), (60) 

so that, if — < |a we have c = 4(1 — fj,a) < 1, so that c n decreases with 
n. Taking the square root on both sides and noting that for positive a and b, 
Va 2 + b 2 <a + b, 



x opt || < c n / 2 ||x op t|| + 2 A /-^/(x opt ). (61) 
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The theorem then follows using the triangle inequality 

1 1 ^ ^|| — ||^ X-opt || ||^ ^-opt 1 1 

< C "/ 2 iix opt ii+2 ni-f( Xopt 



+ ||x-x op i||. (62) 
The iteration count is found by setting 

c n/2 ||x op4 || <<5(x op4 ). (63) 



so that after 



iterations 



In (S^4 

n = 2 V ' |x ° ptll/ , (64) 
lnc 



||x n -x||< (2 J + 6J /( Xopt ) + ||x - Xop i||. (65) 

□ 

3.3 When and Where is this Theory Applicable? 

Since we first derived the result here, it has been shown that properties such 
as the Restricted Strict Convexity Property do indeed hold for certain non- 
linear functions such as those encountered in certain logistic regression problems 
|13j . These recent findings thus further strengthen the case for a detailed study 
of non-convexly constrained non-linear problems and the derivation of novel 
methodologies for their solution. 

It may thus seem tempting to use this theory also in a non-linear Compressed 
Sensing setting, where we would have /(x) = ||y — $(x)||^, where J| • \\q is 
some Banach space norm and where <&(•) is some non-linear function_|. If this 
/(x) would satisfy the Restricted Strict Convexity property, then the Theory in 
the second part of this paper would indeed tell us how to solve the non-linear 
Compressed Sensing problem. 

Unfortunately, it is far from clear yet under which conditions on /(x) = 
||y — 3>(x)||^ Restricted Strict Convexity type properties hold. Indeed, the 
following lemma shows that such a condition cannot be fulfilled in general for 
Hilbert spaces. 

Lemma 7. Assume B is a Hilbert space and assume /(x) is convex on A + A 
for all y (i.e. it Satisfies the Restricted Strict Convexity Property), then is 
affine on all subspaces of A + A. 



x This was indeed the setting proposed in |TT) . 
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Proof. The proof was suggested by an anonymous reviewer of the earlier version 
of this manuscript [Tl] and uses contradiction. Assume <& is not affine on any 
subspace of A+A. Thus, there is a subspace S = Ai+Aj, and x n G S, such that 

for x = Yju ^nX n , where Y, n X n = 1 and - ^n, w e have Yl n *( x «) ~ *( x ) / 
Now by assumption of strong convexity on S, we have (using y n = 3>(x n ) and 

-y = x ) 

o < y, x n\\y - *( x n)ii 2 - iiy - *( x )ii 2 = E A -i'y - y»n a - Hy - y" 2 

n n 

= 2(y,y-EA n y n )+EA n ||y n || 2 - ||y|| 2 . (66) 

n n 

where the inequality is due to the assumption of convexity. But the above 
inequality cannot hold for all y (it fails for example for a multiple of — (y — 
Yin ^riYn))- Thus needs to be affine on the linear subsets of A + A. □ 

Whilst this implies that the property cannot hold in Hilbert spaces for non- 
affine <& and all y, it does not preclude the possibility that it could hold for 
specific observations y. This would not allow us to build a general signal recovery 
framework, but might still allow us the recovery of a subset of signals. Thus, 
for the non-linear Compressed Sensing problem in Hilbert space, the Restricted 
Isometry Property of the Jacobian of <fr(x) together with the ability to construct 
a good linear approximation of <&(x) seem to be the more suitable tools to study 
recovery performance. Nevertheless, for certain other non-convexly constrained 
non- linear optimisation problems, such as those addressed in [13], the Restricted 
Strict Convexity Property might be the more appropriate framework. Whilst 
there are many similarities between these requirements and they both boil down 
to the same RIP property in the linear setting, it remains to be seen what the 
exact relationship is between these two measures in general non-linear problems. 



4 Conclusions 

Compressed Sensing ideas can be developed in much more general settings than 
considered traditionally. We have shown previously [2] that sparsity is not the 
only structure that allows signals to be recovered and that the finite dimensional 
setting can be replaced with a much more general Hilbert space framework. In 
this paper we have made a further important generalisation and have introduced 
the concept of non-linear measurements into Compressed Sensing theory. Under 
certain conditions, such as the requirement that the Jacobian of the measure- 
ment system satisfies a Restricted Isometry Property, then the Iterative Hard 
Thresholding algorithm can be used to recover signals from a non-convex con- 
straint set with similar error bounds to those derived in Compressed Sensing. 
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In the second part of this paper we have then looked at the related and in 
some sense more general setting of non-linear optimisation under non-convex 
constraints. Here we have looked the Restricted Strict Convexity Property as a 
tool to study recovery performance and it was shown that that this condition is 
indeed sufficient for the Iterative Hard Thresholding to find points that are near 
the optimal solution. 
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